Code
library(dplyr) # dataframe tools
library(Matrix) # sparse matrices
library(Seurat) # scRNA-seq tools
library(ggplot2) # plots
library(reticulate) # Python interfaceJanuary 21, 2024
The need to integrate data from scRNA-seq samples into one harmonized dataset has increased in recent years as single cell sequencing has gotten cheaper, making it easier to collect data from multiple subjects, timepoints, or conditions. The goal of most integration techniques is to create an embedding that is (relatively) free of batch effects and that does not lead to the cells clustering by subject ID. There are many methods available to perform integration in both R & Python, and evaluating which method is “best” for your dataset can be tricky & subjective. In this tutorial we’ll try out several of the integration methods available in the scanpy package and compare their results. We’ll use a SmartSeq2 dataset containing myeloid cells from the developing human fetal liver that was originally analyzed in Popescu et al (2020). As such, our goal will be to determine which integration method results in the best embedding for downstream trajectory analysis. In developmental biology settings, a “good” scRNA-seq integrated embedding is smoothly connected (i.e., no spaced-out, discrete clusters), and places celltypes in roughly the order expected based on known biology.
/Users/jack/Desktop/PhD/Research/Python_Envs/personal_site/lib/python3.11/site-packages/scvi/_settings.py:63: UserWarning: Since v1.0.0, scvi-tools no longer uses a random seed by default. Run `scvi.settings.seed = 0` to reproduce results from previous versions.
self.seed = seed
/Users/jack/Desktop/PhD/Research/Python_Envs/personal_site/lib/python3.11/site-packages/scvi/_settings.py:70: UserWarning: Setting `dl_pin_memory_gpu_training` is deprecated in v1.0 and will be removed in v1.1. Please pass in `pin_memory` to the data loaders instead.
self.dl_pin_memory_gpu_training = (
import warnings # filter out warnings
import numpy as np # matrix utilities
import scanpy as sc # scRNA-seq processing
import pandas as pd # dataframe tools
import anndata as ad # scRNA-seq data structures
import matplotlib.pyplot as plt # plot utilities
from scipy.sparse import csr_matrix # sparse matricesmatplotlibHere we define a theme for matplotlib that mostly matches ggplot2::theme_classic().
base_size = 12
plt.rcParams.update({
# font
'font.size': base_size,
'font.weight': 'normal',
# figure
'figure.dpi': 300,
'figure.edgecolor': 'white',
'figure.facecolor': 'white',
'figure.figsize': (6, 4),
'figure.constrained_layout.use': True,
# axes
'axes.edgecolor': 'black',
'axes.grid': False,
'axes.labelpad': 2.75,
'axes.labelsize': base_size * 0.8,
'axes.linewidth': 1.5,
'axes.spines.right': False,
'axes.spines.top': False,
'axes.titlelocation': 'left',
'axes.titlepad': 11,
'axes.titlesize': base_size,
'axes.titleweight': 'normal',
'axes.xmargin': 0.1,
'axes.ymargin': 0.1,
# legend
'legend.borderaxespad': 1,
'legend.borderpad': 0.5,
'legend.columnspacing': 2,
'legend.fontsize': base_size * 0.8,
'legend.frameon': False,
'legend.handleheight': 1,
'legend.handlelength': 1.2,
'legend.labelspacing': 1,
'legend.title_fontsize': base_size,
'legend.markerscale': 1.25
})We’ll start by reading in the SmartSeq2 scRNA-seq data from Popescu et al, which we downloaded from the Human Developmental Cell Atlas portal. Some massaging of the data is necessary to get it ready to pass into Python.
seu_blood <- readRDS("../../datasets/fetal_liver_SS2.RDS")
class(seu_blood) <- "Seurat"
blood_counts <- as.matrix(seu_blood@raw.data)
blood_counts <- blood_counts[, colnames(seu_blood@scale.data)]
cell_metadata <- seu_blood@meta.data[colnames(seu_blood@scale.data), ] %>%
mutate(cell = rownames(.),
.before = 1)
gene_metadata <- data.frame(gene = rownames(blood_counts))Using the reticulate R package we transfer our raw counts matrix and cell & gene metadata into Python, then create an AnnData object. Lastly, we rename some cell metadata features and subset to just the macrophage development lineage. This will simplify our trajectory structure and make choosing a good integration / embedding combination easier.
blood_counts = csr_matrix(r.blood_counts.transpose())
cell_metadata = r.cell_metadata
gene_metadata = r.gene_metadata
ad_blood = ad.AnnData(blood_counts)
ad_blood.obs_names = cell_metadata['cell']
ad_blood.var_names = gene_metadata['gene']
ad_blood.obs = cell_metadata
ad_blood.obs.rename(columns={'cell.labels': 'celltype', 'fetal.ids': 'fetal_ID', 'percent.mito': 'percent_MT', 'sort.ids': 'sort_ID'}, inplace=True)
ad_blood = ad_blood[ad_blood.obs['celltype'].isin(['HSC_MPP', 'Neutrophil-myeloid progenitor', 'Monocyte precursor', 'Monocyte', 'Mono-Mac', 'Kupffer Cell'])]
ad_blood.obs['celltype'] = (
ad_blood.obs['celltype']
.map(lambda x: {'HSC_MPP': 'HSC', 'Mono-Mac': 'Monocyte-macrophage', 'Kupffer Cell': 'Kupffer cell'}.get(x, x))
.astype('category')
)
ad_blood.var = gene_metadata
ad_bloodAnnData object with n_obs × n_vars = 486 × 33660
obs: 'cell', 'nGene', 'nUMI', 'orig.ident', 'percent_MT', 'fetal_ID', 'sort_ID', 'tissue.id', 'plate.id', 'lanes', 'stages', 'sample.type', 'gender', 'cell.label', 'doublets', 'celltype'
var: 'gene'
After removing cells classified as doublets by the original authors, we filter out low-depth cells and genes expressed in less than 10 cells. We then set up a new layer named counts containing the raw counts - this is necessary for integration with scVI, which takes as input the raw, unnormalized expression estimates. Lastly, we identify 3,000 HVGs and subset our object to just those genes.
ad_blood = ad_blood[ad_blood.obs['doublets'] == 'Singlet']
sc.pp.filter_cells(ad_blood, min_counts=200)
sc.pp.filter_genes(ad_blood, min_cells=10)
ad_blood.layers['counts'] = ad_blood.X.copy()
ad_blood.raw = ad_blood
sc.pp.highly_variable_genes(
ad_blood,
n_top_genes=3000,
flavor='seurat_v3',
layer='counts',
subset=True
)scVIscVI modelWe’ll integrate across the ID specifying which fetus the cells came from. We also make scVI aware of our celltypes, and instruct it to regress out variation associated with the percentage of mitochondrial reads.
We allow gene dispersion estimates to vary by celltype, and specify expression as following a negative-binomial distribution.
Finally we train a variational autoencoder (VAE) over 250 epochs to embed the cells in 20-dimensional latent space.
We use the cosine distance to identify 20 NNs for each cell in the latent scVI space. This neighbor graph will serve as the basis for our nonlinear embeddings.
Using default parameters, we fit a 2D UMAP embedding.
The UMAP embedding is a bit oddly-shaped, but celltypes are well-connected and their progression follows known biology i.e., HSCs form precursor populations which develop into mature monocytic cells.
With the Fruchterman-Reingold algorithm we generate a force-directed graph embedding in 2 dimensions. In my experience this algorithm often works better than UMAP at preserving trajectory structures.
Similar to the UMAP embedding, the FR embedding is well-connected and the celltypes are arranged correctly. I like this embedding better, but that’s mostly based on aesthetics as well as some intuition about how pseudotime estimation would perform on each.
Lastly, we estimate a diffusion map embedding in 15 dimensions. This algorithm is specifically designed to preserve transitional structures, but in my experience it usually only works well on datasets with very simple trajectories.
Like the UMAP embedding the diffusion map embedding is pretty angular, but it recapitulates the biology well. Altogether, the scVI integration seems to have worked correctly as all embedding algorithms perform reasonably well.
HarmonyThe Harmony algorithm works by “correcting” a principal component embedding for batch effects. As such, we need to first normalize and variance-stabilize the data.
Next we scale the normalized counts and run PCA.
The PCA embedding on its own is alright, and I don’t imagine the integration procedure will change it much.
Harmony2023-12-08 17:54:12,890 - harmonypy - INFO - Computing initial centroids with sklearn.KMeans...
/Users/jack/Desktop/PhD/Research/Python_Envs/personal_site/lib/python3.11/site-packages/threadpoolctl.py:1019: RuntimeWarning: libc not found. The ctypes module in Python 3.11 is maybe too old for this OS.
warnings.warn(
2023-12-08 17:54:12,945 - harmonypy - INFO - sklearn.KMeans initialization complete.
2023-12-08 17:54:12,946 - harmonypy - INFO - Iteration 1 of 10
2023-12-08 17:54:12,992 - harmonypy - INFO - Iteration 2 of 10
2023-12-08 17:54:13,037 - harmonypy - INFO - Iteration 3 of 10
2023-12-08 17:54:13,081 - harmonypy - INFO - Iteration 4 of 10
2023-12-08 17:54:13,127 - harmonypy - INFO - Iteration 5 of 10
2023-12-08 17:54:13,171 - harmonypy - INFO - Converged after 5 iterations
As expected, the corrected PCA space is pretty similar to the original one.
We compute \(k = 20\) NNs in the Harmony PCA space.
Uh oh - the UMAP embedding displays disconnected clusters of cells. This indicates that the integration didn’t perform very well, though other dimension reduction algorithms might perform better.
The FR embedding shows a disconnected cluster of monocytes as well. In addition, the progression from HSCs to progenitor / mature states is not well-represented.
The diffusion map embedding is also fairly disconnected and has a tiny outlier cluster.
BBKNNNext we try the BBKNN algorithm which, instead of producing an integrated embedding (as do the other methods) produces a batch-corrected neighborhood graph. We can then use that graph structure to generate UMAP and other embeddings.
The UMAP is somewhat smoothly-connected, but the biological sequence of celltypes isn’t correct.
The FR embedding suffers from the same issue.
The diffusion map embedding also misplaces the monocyte phenotype.
ScanoramaThe last method we’ll try is Scanorama. Like scVi and Harmony this algorithm produces a corrected low-dimensional embedding.
We identify \(k = 20\) NNs in the integrated space.
With UMAP the celltype progression is represented correctly, though the transition from the monocyte-macrophage phenotype to the mature Kupffer cells is a bit wonky.
The FR embedding is a bit worse and isn’t very smoothly connected.
The diffusion map embedding is mostly OK, but there’s a weird outlier cluster of monocytes located at the minimum of the first component.
Overall, it seems that the scVI latent representation of the data performed the best. As for embeddings of that integrated space, I would probably choose the force-directed graph layout, though UMAP also performed well. The primary drawback of this method isn’t readily apparent with this dataset - computation time. I’ve experienced long runtimes (1h+) on even medium-sized datasets of around 25-40k cells. This can make it difficult to experiment and tweak the algorithm’s parameters, but overall it tends to perform well without much tuning. As for non-deep learning methods, I would subjectively say that Scanorama provided the second-best embeddings, though I’ve also had good experiences with the R implementation of Harmony. As always, it’s important to try multiple methods and compare results both visually and quantitatively.
─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.2.1 (2022-06-23)
os macOS Big Sur ... 10.16
system x86_64, darwin17.0
ui X11
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz America/New_York
date 2023-12-08
pandoc 3.1.9 @ /usr/local/bin/ (via rmarkdown)
─ Packages ───────────────────────────────────────────────────────────────────
package * version date (UTC) lib source
abind 1.4-5 2016-07-21 [1] CRAN (R 4.2.0)
cli 3.6.1 2023-03-23 [1] CRAN (R 4.2.0)
cluster 2.1.4 2022-08-22 [1] CRAN (R 4.2.0)
codetools 0.2-18 2020-11-04 [1] CRAN (R 4.2.1)
colorspace 2.0-3 2022-02-21 [1] CRAN (R 4.2.0)
cowplot 1.1.1 2020-12-30 [1] CRAN (R 4.2.0)
data.table 1.14.2 2021-09-27 [1] CRAN (R 4.2.0)
deldir 1.0-6 2021-10-23 [1] CRAN (R 4.2.0)
digest 0.6.33 2023-07-07 [1] CRAN (R 4.2.0)
dotCall64 1.0-1 2021-02-11 [1] CRAN (R 4.2.0)
dplyr * 1.1.3 2023-09-03 [1] CRAN (R 4.2.0)
ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.2.0)
evaluate 0.23 2023-11-01 [1] CRAN (R 4.2.1)
fansi 1.0.3 2022-03-24 [1] CRAN (R 4.2.0)
fastDummies 1.7.3 2023-07-06 [1] CRAN (R 4.2.0)
fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.2.0)
fitdistrplus 1.1-8 2022-03-10 [1] CRAN (R 4.2.0)
future 1.33.0 2023-07-01 [1] CRAN (R 4.2.0)
future.apply 1.9.0 2022-04-25 [1] CRAN (R 4.2.0)
generics 0.1.3 2022-07-05 [1] CRAN (R 4.2.0)
ggplot2 * 3.4.4 2023-10-12 [1] CRAN (R 4.2.0)
ggrepel 0.9.4 2023-10-13 [1] CRAN (R 4.2.0)
ggridges 0.5.3 2021-01-08 [1] CRAN (R 4.2.0)
globals 0.16.1 2022-08-28 [1] CRAN (R 4.2.1)
glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.0)
goftest 1.2-3 2021-10-07 [1] CRAN (R 4.2.0)
gridExtra 2.3 2017-09-09 [1] CRAN (R 4.2.0)
gtable 0.3.0 2019-03-25 [1] CRAN (R 4.2.0)
here 1.0.1 2020-12-13 [1] CRAN (R 4.2.0)
htmltools 0.5.3 2022-07-18 [1] CRAN (R 4.2.0)
htmlwidgets 1.5.4 2021-09-08 [1] CRAN (R 4.2.0)
httpuv 1.6.5 2022-01-05 [1] CRAN (R 4.2.0)
httr 1.4.4 2022-08-17 [1] CRAN (R 4.2.0)
ica 1.0-3 2022-07-08 [1] CRAN (R 4.2.0)
igraph 1.5.1 2023-08-10 [1] CRAN (R 4.2.0)
irlba 2.3.5.1 2022-10-03 [1] CRAN (R 4.2.1)
jsonlite 1.8.7 2023-06-29 [1] CRAN (R 4.2.0)
KernSmooth 2.23-20 2021-05-03 [1] CRAN (R 4.2.1)
knitr 1.40 2022-08-24 [1] CRAN (R 4.2.0)
later 1.3.0 2021-08-18 [1] CRAN (R 4.2.0)
lattice 0.20-45 2021-09-22 [1] CRAN (R 4.2.1)
lazyeval 0.2.2 2019-03-15 [1] CRAN (R 4.2.0)
leiden 0.4.2 2022-05-09 [1] CRAN (R 4.2.0)
lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.2.0)
listenv 0.8.0 2019-12-05 [1] CRAN (R 4.2.0)
lmtest 0.9-40 2022-03-21 [1] CRAN (R 4.2.0)
magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.0)
MASS 7.3-60 2023-05-04 [1] CRAN (R 4.2.0)
Matrix * 1.6-3 2023-11-14 [1] CRAN (R 4.2.1)
matrixStats 0.62.0 2022-04-19 [1] CRAN (R 4.2.0)
mime 0.12 2021-09-28 [1] CRAN (R 4.2.0)
miniUI 0.1.1.1 2018-05-18 [1] CRAN (R 4.2.0)
munsell 0.5.0 2018-06-12 [1] CRAN (R 4.2.0)
nlme 3.1-159 2022-08-09 [1] CRAN (R 4.2.0)
parallelly 1.36.0 2023-05-26 [1] CRAN (R 4.2.0)
patchwork 1.1.2 2022-08-19 [1] CRAN (R 4.2.0)
pbapply 1.5-0 2021-09-16 [1] CRAN (R 4.2.0)
pillar 1.9.0 2023-03-22 [1] CRAN (R 4.2.0)
pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.0)
plotly 4.10.0 2021-10-09 [1] CRAN (R 4.2.0)
plyr 1.8.7 2022-03-24 [1] CRAN (R 4.2.0)
png 0.1-7 2013-12-03 [1] CRAN (R 4.2.0)
polyclip 1.10-0 2019-03-14 [1] CRAN (R 4.2.0)
progressr 0.10.1 2022-06-03 [1] CRAN (R 4.2.0)
promises 1.2.0.1 2021-02-11 [1] CRAN (R 4.2.0)
purrr 1.0.2 2023-08-10 [1] CRAN (R 4.2.0)
R6 2.5.1 2021-08-19 [1] CRAN (R 4.2.0)
RANN 2.6.1 2019-01-08 [1] CRAN (R 4.2.0)
RColorBrewer 1.1-3 2022-04-03 [1] CRAN (R 4.2.0)
Rcpp 1.0.9 2022-07-08 [1] CRAN (R 4.2.0)
RcppAnnoy 0.0.19 2021-07-30 [1] CRAN (R 4.2.0)
RcppHNSW 0.4.1 2022-07-18 [1] CRAN (R 4.2.0)
reshape2 1.4.4 2020-04-09 [1] CRAN (R 4.2.0)
reticulate * 1.28 2023-01-27 [1] CRAN (R 4.2.0)
rlang 1.1.2 2023-11-04 [1] CRAN (R 4.2.1)
rmarkdown 2.16 2022-08-24 [1] CRAN (R 4.2.0)
ROCR 1.0-11 2020-05-02 [1] CRAN (R 4.2.0)
rprojroot 2.0.3 2022-04-02 [1] CRAN (R 4.2.0)
RSpectra 0.16-1 2022-04-24 [1] CRAN (R 4.2.0)
Rtsne 0.16 2022-04-17 [1] CRAN (R 4.2.0)
scales 1.2.1 2022-08-20 [1] CRAN (R 4.2.0)
scattermore 1.2 2023-06-12 [1] CRAN (R 4.2.0)
sctransform 0.4.1 2023-10-19 [1] CRAN (R 4.2.0)
sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.0)
Seurat * 5.0.0 2023-11-04 [1] CRAN (R 4.2.1)
SeuratObject * 5.0.0 2023-10-26 [1] CRAN (R 4.2.1)
shiny 1.7.2 2022-07-19 [1] CRAN (R 4.2.0)
sp * 1.5-0 2022-06-05 [1] CRAN (R 4.2.0)
spam 2.9-1 2022-08-07 [1] CRAN (R 4.2.0)
spatstat.data 3.0-0 2022-10-21 [1] CRAN (R 4.2.0)
spatstat.explore 3.0-5 2022-11-10 [1] CRAN (R 4.2.1)
spatstat.geom 3.0-3 2022-10-25 [1] CRAN (R 4.2.0)
spatstat.random 3.0-1 2022-11-03 [1] CRAN (R 4.2.0)
spatstat.sparse 3.0-0 2022-10-21 [1] CRAN (R 4.2.0)
spatstat.utils 3.0-1 2022-10-19 [1] CRAN (R 4.2.0)
stringi 1.7.8 2022-07-11 [1] CRAN (R 4.2.0)
stringr 1.5.0 2022-12-02 [1] CRAN (R 4.2.0)
survival 3.4-0 2022-08-09 [1] CRAN (R 4.2.0)
tensor 1.5 2012-05-05 [1] CRAN (R 4.2.0)
tibble 3.2.1 2023-03-20 [1] CRAN (R 4.2.0)
tidyr 1.3.0 2023-01-24 [1] CRAN (R 4.2.0)
tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.2.0)
utf8 1.2.2 2021-07-24 [1] CRAN (R 4.2.0)
uwot 0.1.16 2023-06-29 [1] CRAN (R 4.2.0)
vctrs 0.6.3 2023-06-14 [1] CRAN (R 4.2.0)
viridisLite 0.4.1 2022-08-22 [1] CRAN (R 4.2.0)
withr 2.5.2 2023-10-30 [1] CRAN (R 4.2.1)
xfun 0.32 2022-08-10 [1] CRAN (R 4.2.0)
xtable 1.8-4 2019-04-21 [1] CRAN (R 4.2.0)
yaml 2.3.5 2022-02-21 [1] CRAN (R 4.2.0)
zoo 1.8-10 2022-04-15 [1] CRAN (R 4.2.0)
[1] /Library/Frameworks/R.framework/Versions/4.2/Resources/library
─ Python configuration ───────────────────────────────────────────────────────
python: /Users/jack/Desktop/PhD/Research/Python_Envs/personal_site/bin/python
libpython: /usr/local/opt/python@3.11/Frameworks/Python.framework/Versions/3.11/lib/python3.11/config-3.11-darwin/libpython3.11.dylib
pythonhome: /Users/jack/Desktop/PhD/Research/Python_Envs/personal_site:/Users/jack/Desktop/PhD/Research/Python_Envs/personal_site
version: 3.11.6 (main, Nov 2 2023, 04:52:24) [Clang 14.0.3 (clang-1403.0.22.14.1)]
numpy: /Users/jack/Desktop/PhD/Research/Python_Envs/personal_site/lib/python3.11/site-packages/numpy
numpy_version: 1.23.5
NOTE: Python version was forced by use_python function
──────────────────────────────────────────────────────────────────────────────
---
title: "Integrating scRNA-seq Datasets with Python"
author:
name: Jack Leary
email: j.leary@ufl.edu
orcid: 0009-0004-8821-3269
affiliations:
- name: University of Florida
department: Biostatistics
city: Gainesville
state: FL
date: today
format:
html:
code-fold: show
code-copy: true
code-tools: true
toc: true
embed-resources: true
fig-format: retina
df-print: kable
link-external-newwindow: true
fig-cap-location: bottom
fig-align: center
execute:
cache: false
freeze: auto
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(warning = FALSE,
comment = NA,
fig.retina = TRUE)
reticulate::use_virtualenv("~/Desktop/PhD/Research/Python_Envs/personal_site/", required = TRUE)
```
# Introduction
The need to integrate data from scRNA-seq samples into one harmonized dataset has increased in recent years as single cell sequencing has gotten cheaper, making it easier to collect data from multiple subjects, timepoints, or conditions. The goal of most integration techniques is to create an embedding that is (relatively) free of batch effects and that does not lead to the cells clustering by subject ID. There are many methods available to perform integration in both R & Python, and evaluating which method is "best" for your dataset can be tricky & subjective. In this tutorial we'll try out several of the integration methods available in the `scanpy` package and compare their results. We'll use a SmartSeq2 dataset containing myeloid cells from the developing human fetal liver that was originally analyzed in [Popescu *et al* (2020)](https://doi.org/10.1038/s41586-019-1652-y). As such, our goal will be to determine which integration method results in the best embedding for downstream trajectory analysis. In developmental biology settings, a "good" scRNA-seq integrated embedding is smoothly connected (i.e., no spaced-out, discrete clusters), and places celltypes in roughly the order expected based on known biology.
# Libraries
## R
```{r, message=FALSE}
library(dplyr) # dataframe tools
library(Matrix) # sparse matrices
library(Seurat) # scRNA-seq tools
library(ggplot2) # plots
library(reticulate) # Python interface
```
## Python
```{python}
import scvi # VAE integration
import warnings # filter out warnings
import numpy as np # matrix utilities
import scanpy as sc # scRNA-seq processing
import pandas as pd # dataframe tools
import anndata as ad # scRNA-seq data structures
import matplotlib.pyplot as plt # plot utilities
from scipy.sparse import csr_matrix # sparse matrices
```
```{python}
warnings.simplefilter('ignore', category=UserWarning)
```
# Theme for `matplotlib`
Here we define a theme for `matplotlib` that mostly matches `ggplot2::theme_classic()`.
```{python}
#| code-fold: true
base_size = 12
plt.rcParams.update({
# font
'font.size': base_size,
'font.weight': 'normal',
# figure
'figure.dpi': 300,
'figure.edgecolor': 'white',
'figure.facecolor': 'white',
'figure.figsize': (6, 4),
'figure.constrained_layout.use': True,
# axes
'axes.edgecolor': 'black',
'axes.grid': False,
'axes.labelpad': 2.75,
'axes.labelsize': base_size * 0.8,
'axes.linewidth': 1.5,
'axes.spines.right': False,
'axes.spines.top': False,
'axes.titlelocation': 'left',
'axes.titlepad': 11,
'axes.titlesize': base_size,
'axes.titleweight': 'normal',
'axes.xmargin': 0.1,
'axes.ymargin': 0.1,
# legend
'legend.borderaxespad': 1,
'legend.borderpad': 0.5,
'legend.columnspacing': 2,
'legend.fontsize': base_size * 0.8,
'legend.frameon': False,
'legend.handleheight': 1,
'legend.handlelength': 1.2,
'legend.labelspacing': 1,
'legend.title_fontsize': base_size,
'legend.markerscale': 1.25
})
```
# Data
We'll start by reading in the SmartSeq2 scRNA-seq data from [Popescu *et al*](https://doi.org/10.1038/s41586-019-1652-y), which we downloaded from [the Human Developmental Cell Atlas portal](https://developmental.cellatlas.io/fetal-liver). Some massaging of the data is necessary to get it ready to pass into Python.
```{r}
seu_blood <- readRDS("../../datasets/fetal_liver_SS2.RDS")
class(seu_blood) <- "Seurat"
blood_counts <- as.matrix(seu_blood@raw.data)
blood_counts <- blood_counts[, colnames(seu_blood@scale.data)]
cell_metadata <- seu_blood@meta.data[colnames(seu_blood@scale.data), ] %>%
mutate(cell = rownames(.),
.before = 1)
gene_metadata <- data.frame(gene = rownames(blood_counts))
```
# Read data into Python
Using the `reticulate` R package we transfer our raw counts matrix and cell & gene metadata into Python, then create an `AnnData` object. Lastly, we rename some cell metadata features and subset to just the macrophage development lineage. This will simplify our trajectory structure and make choosing a good integration / embedding combination easier.
```{python}
blood_counts = csr_matrix(r.blood_counts.transpose())
cell_metadata = r.cell_metadata
gene_metadata = r.gene_metadata
ad_blood = ad.AnnData(blood_counts)
ad_blood.obs_names = cell_metadata['cell']
ad_blood.var_names = gene_metadata['gene']
ad_blood.obs = cell_metadata
ad_blood.obs.rename(columns={'cell.labels': 'celltype', 'fetal.ids': 'fetal_ID', 'percent.mito': 'percent_MT', 'sort.ids': 'sort_ID'}, inplace=True)
ad_blood = ad_blood[ad_blood.obs['celltype'].isin(['HSC_MPP', 'Neutrophil-myeloid progenitor', 'Monocyte precursor', 'Monocyte', 'Mono-Mac', 'Kupffer Cell'])]
ad_blood.obs['celltype'] = (
ad_blood.obs['celltype']
.map(lambda x: {'HSC_MPP': 'HSC', 'Mono-Mac': 'Monocyte-macrophage', 'Kupffer Cell': 'Kupffer cell'}.get(x, x))
.astype('category')
)
ad_blood.var = gene_metadata
ad_blood
```
# Preprocessing
After removing cells classified as doublets by the original authors, we filter out low-depth cells and genes expressed in less than 10 cells. We then set up a new layer named **counts** containing the raw counts - this is necessary for integration with `scVI`, which takes as input the raw, unnormalized expression estimates. Lastly, we identify 3,000 HVGs and subset our object to just those genes.
```{python}
ad_blood = ad_blood[ad_blood.obs['doublets'] == 'Singlet']
sc.pp.filter_cells(ad_blood, min_counts=200)
sc.pp.filter_genes(ad_blood, min_cells=10)
ad_blood.layers['counts'] = ad_blood.X.copy()
ad_blood.raw = ad_blood
sc.pp.highly_variable_genes(
ad_blood,
n_top_genes=3000,
flavor='seurat_v3',
layer='counts',
subset=True
)
```
# Integration with `scVI`
[Paper](https://doi.org/10.1038/s41592-018-0229-2), [Docs](https://scvi-tools.org)
## Train `scVI` model
We'll integrate across the ID specifying which fetus the cells came from. We also make `scVI` aware of our celltypes, and instruct it to regress out variation associated with the percentage of mitochondrial reads.
```{python}
#| results: hide
scvi.settings.verbosity = 0
scvi.settings.seed = 312
scvi.settings.num_threads = 4
scvi.model.SCVI.setup_anndata(
ad_blood,
layer='counts',
batch_key='fetal_ID',
labels_key='celltype',
continuous_covariate_keys=['percent_MT']
)
```
We allow gene dispersion estimates to vary by celltype, and specify expression as following a negative-binomial distribution.
```{python}
int_model = scvi.model.SCVI(
ad_blood,
n_layers=3,
n_hidden=96,
n_latent=20,
gene_likelihood='nb',
dispersion='gene-label'
)
```
Finally we train a variational autoencoder (VAE) over 250 epochs to embed the cells in 20-dimensional latent space.
```{python}
#| results: hide
int_model.train(
early_stopping=True,
accelerator='cpu',
max_epochs=250,
train_size=0.8
)
ad_blood.obsm['X_scVI'] = int_model.get_latent_representation()
```
```{python}
#| code-fold: true
#| fig-cap: scVI embedding
#| fig-width: 6
#| fig-height: 4
#| label: fig-scvi_embed
sc.pl.embedding(
ad_blood,
basis='scVI',
color='celltype',
title='',
frameon=True,
size=30,
alpha=0.75,
show=False
)
plt.gca().set_xlabel('scVI 1')
plt.gca().set_ylabel('scVI 2')
plt.show()
```
## SNN graph estimation
We use the cosine distance to identify 20 NNs for each cell in the latent `scVI` space. This neighbor graph will serve as the basis for our nonlinear embeddings.
```{python}
sc.pp.neighbors(
ad_blood,
n_neighbors=20,
n_pcs=None,
metric='cosine',
random_state=312,
use_rep='X_scVI'
)
```
## UMAP embedding
Using default parameters, we fit a 2D UMAP embedding.
```{python}
sc.tl.umap(ad_blood, random_state=312)
```
The UMAP embedding is a bit oddly-shaped, but celltypes are well-connected and their progression follows known biology i.e., HSCs form precursor populations which develop into mature monocytic cells.
```{python}
#| code-fold: true
#| fig-cap: scVI-based UMAP embedding
#| fig-width: 6
#| fig-height: 4
#| label: fig-umap_embed_scvi
sc.pl.embedding(
ad_blood,
basis='umap',
color='celltype',
title='',
frameon=True,
size=30,
alpha=0.75,
show=False
)
plt.gca().set_xlabel('UMAP 1')
plt.gca().set_ylabel('UMAP 2')
plt.show()
```
## Force-directed graph embedding
With the Fruchterman-Reingold algorithm we generate a force-directed graph embedding in 2 dimensions. In my experience this algorithm often works better than UMAP at preserving trajectory structures.
```{python}
sc.tl.draw_graph(
ad_blood,
layout='fr',
random_state=312,
n_jobs=2
)
```
Similar to the UMAP embedding, the FR embedding is well-connected and the celltypes are arranged correctly. I like this embedding better, but that's mostly based on aesthetics as well as some intuition about how pseudotime estimation would perform on each.
```{python}
#| code-fold: true
#| fig-cap: scVI-based force-directed graph embedding
#| fig-width: 6
#| fig-height: 4
#| label: fig-graph_embed_scvi
sc.pl.draw_graph(
ad_blood,
color='celltype',
title='',
alpha=0.75,
size=30,
show=False
)
plt.gca().set_xlabel('FR 1')
plt.gca().set_ylabel('FR 2')
plt.show()
```
## Diffusion map embedding
Lastly, we estimate a diffusion map embedding in 15 dimensions. This algorithm is specifically designed to preserve transitional structures, but in my experience it usually only works well on datasets with very simple trajectories.
```{python}
sc.tl.diffmap(
ad_blood,
random_state=312,
n_comps=16
)
ad_blood.obsm['X_diffmap_old'] = ad_blood.obsm['X_diffmap']
ad_blood.obsm['X_diffmap'] = ad_blood.obsm['X_diffmap'][:, 1:]
```
Like the UMAP embedding the diffusion map embedding is pretty angular, but it recapitulates the biology well. Altogether, the `scVI` integration seems to have worked correctly as all embedding algorithms perform reasonably well.
```{python}
#| code-fold: true
#| fig-cap: scVI-based diffusion map embedding
#| fig-width: 6
#| fig-height: 4
#| label: fig-diffmap_embed_scvi
sc.pl.diffmap(
ad_blood,
color='celltype',
title='',
alpha=0.75,
size=30,
show=False
)
plt.gca().set_xlabel('DC 1')
plt.gca().set_ylabel('DC 2')
plt.show()
```
# Integration with `Harmony`
[Paper](https://doi.org/10.1038/s41592-019-0619-0), [Docs](https://github.com/slowkow/harmonypy)
## Normalization
The `Harmony` algorithm works by "correcting" a principal component embedding for batch effects. As such, we need to first normalize and variance-stabilize the data.
```{python}
sc.pp.normalize_total(ad_blood, target_sum=1e4)
sc.pp.log1p(ad_blood)
```
## PCA embedding
Next we scale the normalized counts and run PCA.
```{python}
sc.pp.scale(ad_blood)
sc.tl.pca(
ad_blood,
n_comps=30,
random_state=312,
use_highly_variable=True
)
```
The PCA embedding on its own is alright, and I don't imagine the integration procedure will change it much.
```{python}
#| code-fold: true
#| fig-cap: PCA embedding
#| fig-width: 6
#| fig-height: 4
#| label: fig-pca_embed
sc.pl.embedding(
ad_blood,
basis='pca',
color='celltype',
title='',
frameon=True,
size=30,
alpha=0.75,
show=False
)
plt.gca().set_xlabel('PC 1')
plt.gca().set_ylabel('PC 2')
plt.show()
```
## PCA correction with `Harmony`
```{python}
sc.external.pp.harmony_integrate(ad_blood, key='fetal_ID')
```
As expected, the corrected PCA space is pretty similar to the original one.
```{python}
#| code-fold: true
#| fig-cap: Harmony embedding
#| fig-width: 6
#| fig-height: 4
#| label: fig-harmony_embed
sc.pl.embedding(
ad_blood,
basis='pca_harmony',
color='celltype',
title='',
frameon=True,
size=30,
alpha=0.75,
show=False
)
plt.gca().set_xlabel('Harmony 1')
plt.gca().set_ylabel('Harmony 2')
plt.show()
```
## SNN graph estimation
We compute $k = 20$ NNs in the `Harmony` PCA space.
```{python}
sc.pp.neighbors(
ad_blood,
n_neighbors=20,
n_pcs=None,
metric='cosine',
random_state=312,
use_rep='X_pca_harmony'
)
```
## UMAP embedding
```{python}
sc.tl.umap(ad_blood, random_state=312)
```
Uh oh - the UMAP embedding displays disconnected clusters of cells. This indicates that the integration didn't perform very well, though other dimension reduction algorithms might perform better.
```{python}
#| code-fold: true
#| fig-cap: Harmony-based UMAP embedding
#| fig-width: 6
#| fig-height: 4
#| label: fig-umap_embed_harmony
sc.pl.embedding(
ad_blood,
basis='umap',
color='celltype',
title='',
frameon=True,
size=30,
alpha=0.75,
show=False
)
plt.gca().set_xlabel('UMAP 1')
plt.gca().set_ylabel('UMAP 2')
plt.show()
```
## Force-directed graph embedding
```{python}
sc.tl.draw_graph(
ad_blood,
layout='fr',
random_state=312,
n_jobs=2
)
```
The FR embedding shows a disconnected cluster of monocytes as well. In addition, the progression from HSCs to progenitor / mature states is not well-represented.
```{python}
#| code-fold: true
#| fig-cap: Harmony-based force-directed graph embedding
#| fig-width: 6
#| fig-height: 4
#| label: fig-graph_embed_harmony
sc.pl.draw_graph(
ad_blood,
color='celltype',
title='',
alpha=0.75,
size=30,
show=False
)
plt.gca().set_xlabel('FR 1')
plt.gca().set_ylabel('FR 2')
plt.show()
```
## Diffusion map embedding
```{python}
sc.tl.diffmap(
ad_blood,
random_state=312,
n_comps=16
)
ad_blood.obsm['X_diffmap_old'] = ad_blood.obsm['X_diffmap']
ad_blood.obsm['X_diffmap'] = ad_blood.obsm['X_diffmap'][:, 1:]
```
The diffusion map embedding is also fairly disconnected and has a tiny outlier cluster.
```{python}
#| code-fold: true
#| fig-cap: Harmony-based diffusion map embedding
#| fig-width: 6
#| fig-height: 4
#| label: fig-diffmap_embed_harmony
sc.pl.diffmap(
ad_blood,
color='celltype',
title='',
alpha=0.75,
size=30,
show=False
)
plt.gca().set_xlabel('DC 1')
plt.gca().set_ylabel('DC 2')
plt.show()
```
# Integration with `BBKNN`
[Paper](https://doi.org/10.1093/bioinformatics/btz625), [Docs](https://github.com/Teichlab/bbknn)
## Batch-specific SNN estimation
Next we try the `BBKNN` algorithm which, instead of producing an integrated embedding (as do the other methods) produces a batch-corrected neighborhood graph. We can then use that graph structure to generate UMAP and other embeddings.
```{python}
sc.external.pp.bbknn(
ad_blood,
batch_key='fetal_ID',
use_annoy=False,
metric='cosine',
pynndescent_random_state=312,
pynndescent_n_neighbors=20
)
```
## UMAP embedding
```{python}
sc.tl.umap(ad_blood, random_state=312)
```
The UMAP is somewhat smoothly-connected, but the biological sequence of celltypes isn't correct.
```{python}
#| code-fold: true
#| fig-cap: BBKNN-based UMAP embedding
#| fig-width: 6
#| fig-height: 4
#| label: fig-umap_embed_bbknn
sc.pl.embedding(
ad_blood,
basis='umap',
color='celltype',
title='',
frameon=True,
size=30,
alpha=0.75,
show=False
)
plt.gca().set_xlabel('UMAP 1')
plt.gca().set_ylabel('UMAP 2')
plt.show()
```
## Force-directed graph embedding
```{python}
sc.tl.draw_graph(
ad_blood,
layout='fr',
random_state=312,
n_jobs=2
)
```
The FR embedding suffers from the same issue.
```{python}
#| code-fold: true
#| fig-cap: BBKNN-based force-directed graph embedding
#| fig-width: 6
#| fig-height: 4
#| label: fig-graph_embed_bbknn
sc.pl.draw_graph(
ad_blood,
color='celltype',
title='',
alpha=0.75,
size=30,
show=False
)
plt.gca().set_xlabel('FR 1')
plt.gca().set_ylabel('FR 2')
plt.show()
```
## Diffusion map embedding
```{python}
sc.tl.diffmap(
ad_blood,
random_state=312,
n_comps=16
)
ad_blood.obsm['X_diffmap_old'] = ad_blood.obsm['X_diffmap']
ad_blood.obsm['X_diffmap'] = ad_blood.obsm['X_diffmap'][:, 1:]
```
The diffusion map embedding also misplaces the monocyte phenotype.
```{python}
#| code-fold: true
#| fig-cap: BBKNN-based diffusion map embedding
#| fig-width: 6
#| fig-height: 4
#| label: fig-diffmap_embed_bbknn
sc.pl.diffmap(
ad_blood,
color='celltype',
title='',
alpha=0.75,
size=30,
show=False
)
plt.gca().set_xlabel('DC 1')
plt.gca().set_ylabel('DC 2')
plt.show()
```
# Integration with `Scanorama`
[Paper](https://doi.org/10.1038/s41587-019-0113-3), [Docs](https://github.com/brianhie/scanorama)
## Batch-correcting integration
The last method we'll try is `Scanorama`. Like `scVi` and `Harmony` this algorithm produces a corrected low-dimensional embedding.
```{python}
sc.external.pp.scanorama_integrate(
ad_blood,
key='fetal_ID',
basis='X_pca',
knn=20
)
```
## SNN graph estimation
We identify $k = 20$ NNs in the integrated space.
```{python}
sc.pp.neighbors(
ad_blood,
n_neighbors=20,
n_pcs=None,
metric='cosine',
random_state=312,
use_rep='X_scanorama'
)
```
## UMAP embedding
```{python}
sc.tl.umap(ad_blood, random_state=312)
```
With UMAP the celltype progression is represented correctly, though the transition from the monocyte-macrophage phenotype to the mature Kupffer cells is a bit wonky.
```{python}
#| code-fold: true
#| fig-cap: Scanorama-based UMAP embedding
#| fig-width: 6
#| fig-height: 4
#| label: fig-umap_embed_scanorama
sc.pl.embedding(
ad_blood,
basis='umap',
color='celltype',
title='',
frameon=True,
size=30,
alpha=0.75,
show=False
)
plt.gca().set_xlabel('UMAP 1')
plt.gca().set_ylabel('UMAP 2')
plt.show()
```
## Force-directed graph embedding
```{python}
sc.tl.draw_graph(
ad_blood,
layout='fr',
random_state=312,
n_jobs=2
)
```
The FR embedding is a bit worse and isn't very smoothly connected.
```{python}
#| code-fold: true
#| fig-cap: Harmony-based force-directed graph embedding
#| fig-width: 6
#| fig-height: 4
#| label: fig-graph_embed_scanorama
sc.pl.draw_graph(
ad_blood,
color='celltype',
title='',
alpha=0.75,
size=30,
show=False
)
plt.gca().set_xlabel('FR 1')
plt.gca().set_ylabel('FR 2')
plt.show()
```
## Diffusion map embedding
```{python}
sc.tl.diffmap(
ad_blood,
random_state=312,
n_comps=16
)
ad_blood.obsm['X_diffmap_old'] = ad_blood.obsm['X_diffmap']
ad_blood.obsm['X_diffmap'] = ad_blood.obsm['X_diffmap'][:, 1:]
```
The diffusion map embedding is mostly OK, but there's a weird outlier cluster of monocytes located at the minimum of the first component.
```{python}
#| code-fold: true
#| fig-cap: Scanorama-based diffusion map embedding
#| fig-width: 6
#| fig-height: 4
#| label: fig-diffmap_embed_scanorama
sc.pl.diffmap(
ad_blood,
color='celltype',
title='',
alpha=0.75,
size=30,
show=False
)
plt.gca().set_xlabel('DC 1')
plt.gca().set_ylabel('DC 2')
plt.show()
```
# Conclusions
Overall, it seems that the `scVI` latent representation of the data performed the best. As for embeddings of that integrated space, I would probably choose the force-directed graph layout, though UMAP also performed well. The primary drawback of this method isn't readily apparent with this dataset - computation time. I've experienced long runtimes (1h+) on even medium-sized datasets of around 25-40k cells. This can make it difficult to experiment and tweak the algorithm's parameters, but overall it tends to perform well without much tuning. As for non-deep learning methods, I would subjectively say that `Scanorama` provided the second-best embeddings, though I've also had good experiences with the R implementation of `Harmony`. As always, it's important to try multiple methods and compare results both visually and quantitatively.
# Session info
```{r}
sessioninfo::session_info()
```